internet stopped

Seven days ago, many services on the web went down. Chances are, you were using one of those services at the time. You probably started wondering if your internet was going wonky. Maybe you rebooted your device. Maybe, you even rebooted your modem and router. But after the device came back up, still no go. What was going on?

According to Amazon, it was their bad.

More specifically, it was Amazon’s own infrastructure software that broke. Hopefully, you will stay with me as I explain this, in layman’s terms. See, Amazon has these large-scale systems of distributions that are composed of smaller independent services. These services interact with each other using API’s allowing AWS to operate them independently. So, they split the system into services that are responsible for executing customer requests (the data plane), and services that are responsible for managing and vending customer configuration (the control plane). Amazon Elastic Compute Cloud (EC2) is an example of an architecture that includes a data plane and a control plane. The data plane consists of physical servers where customers’ Amazon EC2 instances run. 

The control plane consists of a number of services that interact with the data plane, performing functions like telling each server about the E2 instance that needs run, keeping running EC2 instances up to date with Amazon Virtual Private Cloud, receiving metering data, logs, and metrics emitted by the servers, and deploying new software to the new server. The two things these two planes have in common are data plane and the control plane need to stay in sync with each other, and the size of the data plane fleet exceeds the size of the control plane fleet, frequently by a factor of 100 or more.

Ok, so now you know what control and data planes are, keep reading for Amazon’s explanation as to what happened. 

To explain this event, we need to share a little about the internals of the AWS network. While the majority of AWS services and all customer applications run within the main AWS network, AWS makes use of an internal network to host foundational services including monitoring, internal DNS, authorization services, and parts of the EC2 control plane. Because of the importance of these services in this internal network, we connect this network with multiple geographically isolated networking devices and scale the capacity of this network significantly to ensure high availability of this network connection. These networking devices provide additional routing and network address translation that allow AWS services to communicate between the internal network and the main AWS network. At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.
icon-cloud-aws
Amazon Spokesman
AWS

Services Effected Last Week

So, to be clear, Amazon’s US-EAST-1 Server was the server effected. This caused so many issues especially with their delivery services. Amazon drivers had no idea where their next stop was because the app, they used to be connected to the infrastructure that was down.

If you were playing PUBG, you would have been kicked and couldn’t get re-connected for over 6 hours.

Alot of home automation was effected. People took to Downdetector.com and complained about how Ring’s doorbells were not functioning correctly. Amazon’s Echo devices would spit out ‘sorry I can’t help you now, try again later’. 

Rumbas went berserk and rose up against their owners.

Even at Walt Disney World, guests had trouble making reservations for different services and checking how long the lines were for different attractions.

Surprisingly, even Google was affected too. (Remember AWS was around before Google got into the cloud storage businesses and teamed up with Amazon’s S3 product.)

Venmo was affected. If you wanted to check your balance or send and receive money, the app would open and fail.

People became pretty Hangry when they never received their Door Dash food.

Even the sleezy crypto marketing app Robinhood was affected. (Hopefully this put the final nail in their coffin. Time will tell.)

Frequently visited government websites, such as My Social Security—a portal for online accounts accessing the U.S Social Security Administration—also reported disruptions.

 

Key Takeaways

Related posts